library(tidyverse)
library(readxl)
library(janitor)

Reading in the data

The Excel file read in this example is analytic_data.xlxs. Replace this with your Excel file.

In this R Markdown file, the data frame is called EXAMPLE_DATA. Replace this with the name of the file you wish to use.

EXAMPLE_DATA <- read_excel("analytic_data.xlsx")
EXAMPLE_DATA <- EXAMPLE_DATA %>% 
  mutate_if(is.character,as.factor)

In all of the code below, you will need to replace EXAMPLE_DATA with the name of your data frame. You will need to use the appropriate variable names also. Remember to save any changes by assigning the result to a data frame (replacing the old or creating a new one, in this case we have used UPDATED_DATA).

Renaming

The convention with rename is new = old.

There is also a built in function in the janitor package called clean_names which can be an easy way to clean a series of names at once.

UPDATED_DATA <- EXAMPLE_DATA %>%
  rename(NEW_NAME_CAT1 = CATEGORICAL_VARIABLE1,
         NEW_NAME_CAT2 = CATEGORICAL_VARIABLE2)

UPDATED_DATA <- EXAMPLE_DATA %>%
  clean_names()

Selecting columns

We can do this by name, column number and various characteristics

UPDATED_DATA <- EXAMPLE_DATA %>%
  select(contains("VARIABLE")) %>%
  select(CATEGORICAL_VARIABLE1:NUMERICAL_VARIABLE3) %>%
  select(c(1,4:5))

Selecting rows

We can use filter to select rows based on a condition (or series of conditions) and slice to select by row number.

UPDATED_DATA <- EXAMPLE_DATA %>%
  filter(CATEGORICAL_VARIABLE1 == "A") %>%
  slice(1:5)

Mutating

This may include mutating numerical variables or recoding factors. A range of functions for working with factors can be found in the forcats package.

UPDATED_DATA <- EXAMPLE_DATA %>%
  mutate(NEW_NUMERICAL_VARIABLE = log(NUMERICAL_VARIABLE1 + NUMERICAL_VARIABLE2),
        NEW_CATEGORICAL_VARIABLE2 = fct_recode(CATEGORICAL_VARIABLE2, "Yes"= "Y", "No"="N"))

Missing data

Using built-in R missing data code

UPDATED_DATA <- EXAMPLE_DATA %>%
   mutate(NEW_CATEGORICAL_VARIABLE2 = na_if(CATEGORICAL_VARIABLE2, "N"))

Same operation across multiple columns

UPDATED_DATA <- EXAMPLE_DATA %>%
  summarise(across(NUMERICAL_VARIABLE1:NUMERICAL_VARIABLE3, ~mean(., na.rm = TRUE)))

Joining tables

A second data set with a common variable (TIME_VARIABLE) has been created in order to demonstrate a range of way to join data sets.

SECOND_EXAMPLE_DATA <- tribble(
  ~ TIME_VARIABLE, ~NEW_DATA,
  1, 1.6,
  2, 1.7,
  3, 1.8,
  4, 1.9,
  5, 2.1,
  9, 3.2,
  10,4.1,
  11, 4.6,
  14, 4.9,
  15, 6.5,
  16, 6.7,
  17, 7.9,
  18, 10.1,
  19, 14.6,
  20, 20.8,
  21, 20.9,
  22, 24.6,
  23, 30.1,
  24, 31.3,
  )        

UPDATED_DATA <- left_join(EXAMPLE_DATA, SECOND_EXAMPLE_DATA, by = "TIME_VARIABLE")

UPDATED_DATA <- right_join(EXAMPLE_DATA, SECOND_EXAMPLE_DATA, by = "TIME_VARIABLE")

UPDATED_DATA <- full_join(EXAMPLE_DATA, SECOND_EXAMPLE_DATA, by = "TIME_VARIABLE")

Converting between long and wide form

UPDATED_DATA <- EXAMPLE_DATA %>%
  mutate(NEW_ID = rep(1:10,2)) %>%
  select(NEW_ID, CATEGORICAL_VARIABLE1, NUMERICAL_VARIABLE1) %>%
  pivot_wider(names_from = CATEGORICAL_VARIABLE1, values_from=NUMERICAL_VARIABLE1)

UPDATED_DATA <- UPDATED_DATA %>%
  pivot_longer(A:B, 
               names_to = "CATEGORICAL_VARIABLE1",
               values_to = "NUMERICAL_VARIABLE1")

© Statistical Consulting Centre, University of Melbourne, 2023